Unbiased sampling of globular lattice proteins in three dimensions 
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We present a Monte Carlo method that allows efhcient and unbiased sampling of Hamiltonian 
walks on a cubic lattice. Such walks are self-avoiding and visit each lattice site exactly once. They 
are often used as simple models of globular proteins, upon adding suitable local interactions. Our 
algorithm can easily be equipped with such interactions, but we study here mainly the flexible 
honiopolymer case where each conformation is generated with uniform probability. We argue that 
the algorithm is ergodic and heis dynamical exponent z = 0. We then use it to study polymers of 
size up to 64^ = 262 144 monomers. Results are presented for the effective interaction between end 
points, and the interaction with the boundaries of the system. 
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Self-avoiding random walks of various types are ubiq- 
uitous in Nature and often serve as a first step in mod- 
eling polymeric systems [ij. In a good solvent, polymers 
assume spatially extended conformations which are well 
described by the usual Self- Avoiding Walk (SAW) model. 
In this model, each monomer is assigned a fugacity /3, and 
the SAW is recovered at the smallest value /3c of /3 so that 
the mean length of the walk diverges. This is a critical 
point, meaning that observables exhibit power law be- 
haviors that are independent of the precise microscopic 
model. Therefore, the latter is conveniently defined on a 
hypercubic lattice in d dimensions. 

By contrast, proteins in their native globular state 
assume compact confirmations that occupy space as 
densely as possible. This suggests modeling such biopoly- 
mers as Hamiltonian Walks (HW) ; by definition these are 
self-avoiding walks which are constrained to visit each of 
the lattice sites exactly once. Upon adding further lo- 
cal interactions, the HW model has been proposed as a 
description of protein melting (with bending rigidity), 
or of protein folding [3| (with suitable interactions among 
amino acids). The simplest example of the latter is the 
so-called HP model in which only two types of amino 
acids (Hydrophobic or Polar) are taken into account. 

In d = 2 dimensions an amazing number of exact re- 
sults on self-avoiding walk models are known, due in par- 
ticular to the application of Conformal Field Theory (see 
[3| for a recent review) . For instance, both the HW model 
and the Flory model of protein melting have been solved 
0. But in the physically more relevant case of d = 3, 
such exact results are not available, and one must resort 
to approximate techniques, or to numerics, to extract in- 
formation about the model at hand. 

Much work on such lattice proteins is based on exact 
enumeration studies in which all possible conformations 
on a small cube of size N — L x L x L are generated. 
Unfortunately, despite of the HW constraint, the number 
of conformations is so large that even the case L = 4 is 
out of reach Needless to say, such small sizes are 
very sensitive to the peculiarities of the lattice model 
(e.g., the choice of lattice), and cannot be considered 



representative of real proteins. 

Therefore, an efficient and unbiased importance sam- 
pling scheme of the Monte Carlo (MC) type would be 
most welcome to study larger L and make precise state- 
ments about the thermodynamic limit L oo. For the 
usual SAW case, this is furnished by the so-called pivot 
algorithm Q- In each move, a randomly chosen lattice 
symmetry is applied to the part of the walk subsequent 
to a randomly chosen pivot point. The move is accepted 
if the resulting walk is self- avoiding. For TV — > oo, the ac- 
ceptance ratio goes to zero, but the few accepted moves 
suffice to decorrelate the system in time ^ TV. But in 
the HW limit, the acceptance ratio becomes identically 
zero, and so the pivot algorithm is useless in this case. 
Alternative growth-type algorithms are not guaranteed 
to yield unbiased sampling [8|. 

In this Letter we propose an MC algorithm which is 
specifically adapted to HW, and we test it in details for 
the d = 3 flexible homopolymer case. We show that it 
satisfies detailed balance and give evidence for ergodic- 
ity; each conformation is therefore generated with uni- 
form probability. In contrast to the pivot algorithm, 
each move is accepted with probability one. One move 
takes time TV, and TV moves are sufficient to completely 
decorrelate the system. Using our algorithm, we gener- 
ate conformations of size TV = 262 144 monomers with 
ample statistics, and give a number of results on their 
geometrical properties. 

Algorithm. Let C be a Hamiltonian walk on the cubic 
lattice. Choose one of its two end points randomly and 
uniformly; call it x. Note that site x is adjacent to one 
occupied and five empty links. Choose one of the five 
empty links with uniform probability; call it (xy). Then 
among the occupied links adjacent to y there is exactly 
one which forms part of a loop in C U (xy); call it (yz). 
[Note that if y is the other end point of C there is one 
occupied link adjacent to y; otherwise there are two.] 
Change C by adding (xy) and removing (yz). 

The MC move just described is illustrated in Fig. [1] 
It was studied in two dimensions in In what follows 
we choose the system to be confined to an cube with 
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FIG. 1: Move between two of the 2 480 304 possible Hamilto- 
nian walks on a 3 x 3 x 3 cubic lattice. 

free (i.e., non periodic) boundary conditions. A small 
modification is then needed if y is outside the cube. In 
that case, the move consists in leaving C unchanged. 

Unbiased sampling. The above move clearly sat- 
isfies detailed balance. To prove that it is also ergodic 
it suffices to show that any C can be transformed into 
a fixed reference configuration Cq in a finite number of 
moves. Let Cq be one of the two configurations with a 
maximal number of occupied horizontal links. One may 
approach Cq by choosing always y so that (xy) is hori- 
zontal. Unfortunately, (yz) may also turn out to be hor- 
izontal, so positive progress is not made in every step. 
Moreover, if both end points end up on the left or right 
side of the cube, and are both adjacent to an occupied 
horizontal link, one will actually have to remove a hor- 
izontal link to proceed. Due to these complications we 
cannot prove ergodicity for arbitrary size L. 

However, we can argue in favor of ergodicity by testing 
it for small lattices. Using exact enumeration techniques, 
we find that there are 2 480 304 HW on an L = 3 cube 
[lo| . and our MC algorithm generates all of them. We 
also find that there are 3 918 744 Hamiltonian Circuits 
(HC), or closed walks, on a 3 x 3 x 4 parallelepiped [lot . 
A HC can be formed from a HW whose end points are 
nearest neighbors on the lattice, by adding the missing 
link. In this way we have checked that all 3 918 744 HC 
can be generated as well. Similar checks have been made 
on smaller parallelepipeds. 

Assuming that ergodicity holds in general, we conclude 
that the MC process converges towards the equilibrium 
distribution, i.e., that all Hamiltonian walks are sampled 
with uniform probability. 

Performance. Identifying the correct edge (yz) to be 
removed necessitates tracing out the loop formed when 
adding xy, and so one move takes a time ~ N. But note 
that when the end point moves from x to z, the part of 
the chain [xzy] is reversed and becomes [zxy]. Thus, if 
one were to use the algorithm to study a heteropolymer 
problem, the sequence of amino acids on that part would 
have to be reversed as well. So in that respect, and in 
terms of chain connectivity, the move is certainly more 
non-local than it looks at first sight. 

Define now the autocorrelation function G{t) as the 



link overlap function with respect to some reference ini- 
tial configuration. The time t is measured in units of 
MC moves per site, and averages are done over 100 inde- 
pendent runs. We have measured G{t) for system sizes 
L = 4, 8, 16, 32 and found that it decays exponentially 
towards an L-dependent constant, which can be deter- 
mined numerically by averaging G{t) over the last 100 
time units in a sufficiently long simulation. Subtracting 
this constant, and rescaling, gives a normalized autocor- 
relation function G{t) satisfying G(0) = 1 and G{t) — > 
as t — > oo. Defining now the autocorrelation time r by 
G{t + t) = ^G{t), we estimate r by averaging over the 
first 4r of the simulation. For the system sizes mentioned 
this gives t = 1.33, 1.27, 1.20, and 1.15. This fits nicely 
as r ~ with dynamical exponent z = — 0.024 ± 0.001. 
Of course we cannot have z < 0, since one MC move 
changes only a finite number (two) of links. We there- 
fore conclude that z « 0. 

Simulations. We have generated conformations on 
lattices up to size L = 64 (i.e., 262 144 monomers) with 
ample statistics. In all cases, the first t = 100 MC moves 
per site were discarded. The various averages and his- 
tograms were then made over T MC moves per site, 
with measurements being made after each individual MC 
move. We have taken T = 10*^ for L < 20, T = 10^ for 
20 < L < 30, T = 10* for 30 < L < 40, T = 10^ for 
40 < L < 60, and T = 10^ for L > 60. Data were gen- 
erated for both even and odd L in order to detect any 
parity effects. 

Asymptotic isotropy. We have measured the his- 
togram for the end-to-end distance vector xi2 = ±(xi — 
X2); note that the sign is immaterial because of the in- 
distinguishability of the end points. Not all values of X12 
are possible, since for N even (resp. odd) the end points 
must belong to different (resp. identical) sublattices by 
an easy parity argument. 

We first checked for L = 4 that the histogram is 
isotropic on the microscopic level. Thus, for each X12 re- 
specting the parity constraint, the 6, 8, 12 or 24 vectors 
related to it by lattice symmetries are found to be gener- 
ated with the same frequency (up to counting statistics 
fluctuations). 

Still, vectors X12 of equal length but unrelated by 
lattice symmetries need not have the same probability 
for finite L. Define for instance the probability ratio 
p = P(xi2 = (0,0,3))/P(xi2 = (1,2,2)). We find 
p ^ 1.642 for i = 4, and p = 1.160 for L = 16. A 
power law fit yields however p ^ 1.00 ± 0.01 for L 00, 
so isotropy is recovered in the thermodynamical limit. It 
is important for a realistic modeling that the large L be- 
havior does not depend on irrelevant microscopic details. 

End-to-end distance. We now construct a his- 
togram for the end-to-end distance X12 ~ |xi2 1 by sum- 
ming over all possible directions of X12 . The discrete his- 
togram is then smoothed in a two-step process. First, for 
each value of x^^2 ™ steps of 0.1, we regroup the counts 



3 



0.1 







' ' 1 


- 








8x8x8 


\ 




12x12x12 






16x16x16 






20x20x20 






24x24x24 






32x32x32 






36x36x36 






44x44x44 






48x48x48 






52x52x52 






64x64x64 






.... 1-85(2) 














, , 1 1 



FIG. 2: Normalized probability distribution of x^c = X\2/L, 
where a; 12 is the end-to-end distance. 

with \xi2 — x{2 \ < 0-5. Second, we construct a run- 
ning average over 20 subsequent x^2^ (i.e., two lattice 
spacings). Once properly normalized, the end result is 
the probability distribution (universal scaling function) 
of the rescaled variable cceo = X12/L. This is shown in 
Fig. [21 where (as in subsequent figures) the small cir- 
cles are the L — 2Q data points after the first step of 
the smoothing procedure. The data collapse is strikingly 
good over the entire ge. Before the onset of finite- 

size effects, we have a power law behavior 

P(Xec) ~ for Xec « 1 . (1) 

If the end points were independent we would have 
P{xee) ~ x^e^ by simplc geometry. The result ^ then 
indicates a weak effective attraction between end points. 

Conformational exponents. For a HW of length 
iV, the radius of gyration i?g ~ L, and the exponent z/ 
appearing in Rg ~ N'^ is trivially = ^ = -j- 

The exponent 7 describes the ratio between the num- 
ber of HW (~ fi^N"/-^) and HC (- N'"'^) passing 
through a fixed point. This leads to 

P{xi2 = l) ^ L-^"^ , 7 = 0.96 ±0.01. (2) 

Standard scahng [ij gives ^ — 1 — Ai and P(xee) ~ 
a^ecT^^^ , and comparing ((I])-© yields then our final esti- 
mate Ai = 0.06 ± 0.02. 

Surface effects. We next examine the interaction of 
the HW with the surfaces of the system. Although glob- 
ular proteins would be more realistically modeled within 
a spherical domain (where we expect our algorithm to 
be ergodic as well), we maintain here the geometry of a 
cube in order to have several types of surface sites. 

For a given end point, let ds, d^, and c?c be respec- 
tively its distance to the nearest side, edge, and corner of 
the cube. For each of these we construct the normalized 
probability distribution as above. [Since ds is an integer. 




FIG. 3: Normalized probability distribution of a::s = ds/L, 
where ds is the distance to the nearest side of the cube. 
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FIG. 4: Normalized probability distribution of = d^/L, 
where dE is the distance to the nearest edge of the cube. 



we do not apply the smoothing procedure in this case.] 
The results, shown in Figs. [3HS1 in all cases lead to very 
satisfactory data collapses. Power-law fits analogous to 
© give P{x^) ~ 472±o.o3 ^ 4-85±o o3. the 

geometric exponents for a non-interacting system would 
be 1 and 2 respectively. 

The behavior of xs = ds/L in Fig. [3] would seem 
in favor of P(a;s) — > p% as xa, 0, with a finite 
value pg. However, using only the data with dg = 0, 
we estimate P(ds = 0) ~ ]^-o.9i±o.02 ^ implying that 
P(a;s)-^4°™^ 

To check in more detail the affinities of the end points 
to the boundary, we divide the lattice sites in four classes: 

C: 8 corner sites with dc = 

E: 12{L - 2) edge sites with d^ = 0, but dc > 

S: 6(L - 2)2 side sites with dg = 0, but cJe > 

B: {L - 2f bulk sites with dg > 

Equivalently, the coordination number (number of links 
adjacent to a site) is 3 for C-sites, 4 for E-sites, 5 for 
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zero. For L = 19 wc find 



FIG. 5: Normalized probability distribution of xq = dc/L, 
where dc is the distance to the nearest corner of the cube. 
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FIG. 6: Normalized probabilities for an end point being a 
corner site (pc), an edge site (pe), a side site (ps), or a bulk 
site (pb), as functions of 1/L. 



S-sites, and 6 for B-sites. 

Wc normalize the probability pi of an end point be- 
longing to a given class i = C,E, S,B by the ratio of 
j-type sites on the lattice. These probabilities are shown 
as functions of 1/L in Fig. [S) Note that for even (resp. 
odd) i, a HW can link any given corner site to 3 (resp. 
7) other corners. Therefore, we plot pc for even L and 
\pc for odd L. With this convention, parity effects are 
seen to vanish as i ^ cxd, within error bars, and we find 
PC = 1.82 ± 0.02, pE = 1-21 ± 0.01, PS = 1.089 ± 0.002, 
andpB = l.OOOOibO.OOOl. Thus, the higher the coordina- 
tion number of a site, the more do the end points tend to 
avoid it (in our normalization). This is remarkable, since 
naively one might have expected the opposite to occur. 

One may similarly study the normalized probabilities 
Pij that both end points belong to prescribed classes i and 
j. From this we form the matrix of cross correlations A 
with entries Aj^ = — 1. Note that if the two end 

would be 



A = 



-0.100 -0.020 -0.006 
-0.020 -0.005 
-0.004 



0.004 
0.003 
0.002 
-0.001 



(3) 



Although the entries are numerically small, there is a 
clear qualitative effect. Namely, if one end point is at a 
site of small coordination number, the other end point 
will try to compensate by being at a site of high coordi- 
nation number. In particular, it is ten percent less likely 
to find both end points in corners than would have been 
expected from the data for just one end point. 

Conclusion. We have presented an algorithm that 
allows for efficient and unbiased sampling of three- 
dimensional lattice protein conformations (Hamiltonian 
walks). We applied it to the flexible homopolymer case, 
where each conformation has the same energy, and ex- 
tracted a number of results on the correlation between 
end points and their interaction with the boundaries of 
the system. The most salient features are 1) the weak 
entropic attraction between end points, 2) the attraction 
of end points towards the surface, edge sites in partic- 
ular, 3) their preference for sites with low coordination 
number, and 4) the vanishing of parity effects for L — > oo. 

We hope to study later topological issues, such as the 
knot formation probability. 

It is a straightforward extension of our algorithm to as- 
sociate an energy to each state and implement a standard 
Metropolis dynamics. This would allow to study models 
with bending rigidity, amino acid interactions, etc. 

A slightly modified algorithm in which end points are 
allowed to grow or retract can be applied to models of 
non-compact conformations. In the standard SAW case, 
it decorrelates a polymer of length N in time ~ N, not 
only globally (as does the pivot algorithm 0]) but also 
locally. We shall report more on this elsewhere. 
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